57 research outputs found

    Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:A Systematic Study

    Full text link
    This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of synchronization protocol, stale gradient updates, minibatch size, learning rates, and number of learners on runtime performance and model accuracy. We introduce a new learning rate modulation strategy to counter the effect of stale gradients and propose a new synchronization protocol that can effectively bound the staleness in gradients, improve runtime performance and achieve good model accuracy. Our empirical investigation reveals a principled approach for distributed training of neural networks: the mini-batch size per learner should be reduced as more learners are added to the system to preserve the model accuracy. We validate this approach using commonly-used image classification benchmarks: CIFAR10 and ImageNet.Comment: Accepted by The IEEE International Conference on Data Mining 2016 (ICDM 2016

    Scaling Deep Learning on GPU and Knights Landing clusters

    Full text link
    The speed of deep neural networks training has become a big bottleneck of deep learning research and development. For example, training GoogleNet by ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training process, the current deep learning systems heavily rely on the hardware accelerators. However, these accelerators have limited on-chip memory compared with CPUs. To handle large datasets, they need to fetch data from either CPU memory or remote processors. We use both self-hosted Intel Knights Landing (KNL) clusters and multi-GPU clusters as our target platforms. From an algorithm aspect, current distributed machine learning systems are mainly designed for cloud systems. These methods are asynchronous because of the slow network and high fault-tolerance requirement on cloud systems. We focus on Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original EASGD used round-robin method for communication and updating. The communication is ordered by the machine rank ID, which is inefficient on HPC clusters. First, we redesign four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD are faster \textcolor{black}{than} their existing counterparts (Async SGD, Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design Sync EASGD, which ties for the best performance among all the methods while being deterministic. In addition to the algorithmic improvements, we use some system-algorithm codesign techniques to scale up the algorithms. By reducing the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x speedup over original EASGD on the same platform. We get 91.5% weak scaling efficiency on 4253 KNL cores, which is higher than the state-of-the-art implementation

    Anti-dyslipidemic activity of acacia tortilis seed extract in alloxan-induced diabetic rats

    Get PDF
    Background: The present study was carried out to evaluate the anti-dyslipidemic activities of seed extract of acacia tortilis (ATE) in alloxan inducd diabetic rats.Methods: The Rats were divided into five groups of six animals each. Groups I and II received normal saline, group III received ATE in dose of 100 mg/kg body weight, group IV received ATE in dose of 200 mg/kg b.w.; and group V received standard drug pioglitazone dose 3 mg/kg b.w. Drugs were administered orally once a day for 30 days. At the end of 0th, 10th, 20th  and 30th day, blood was collected to analyse serum glucose, serum insulin, total cholesterol (TC), serum phospholipid (PL), serum triglyceride (TG), Free fatty acids (FFA) and High density lipoprotein (HDL).Results: The results has been showed that ATE in above doses significantly increase the serum insulin and HDL level but significantly decreased the elevated level of TC, PL, TG , FFA, LDL and VLDL. It also decreased the atherogenic index and coronary risk index level significantly which was comparable with the pioglitazone.Conclusions: It is concluded that the seed extract of acacia tortilis at the dose of 100 and 200 mg/kg body weight produced significant anti-dyslipidemic activity in alloxan-induced diabetic rats

    PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning

    Full text link
    With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.Comment: To be published in the Proceedings of Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 20
    • …
    corecore